Large and heterogeneous datasets may contain thousands of records missing spatial or taxonomic information (partially or entirely) as well as several records outside a region of interest or from doubtful sources. Such lower quality data are not fit for use in many research applications without prior amendments. The ‘Pre-filter’ step contains a series of of tests to detect, remove, and, whenever, possible, correct such erroneous or suspect records.
Important:
The results of VALIDATION test used to flag data quality is appended in separate fields in this database and retrieved as TRUE or FALSE, in which the former indicates correct records and the latter potentially problematic or suspect records.
You can install the released version of ‘BDC’ from github with:
if (!require("remotes")) install.packages("remotes")
if (!require("bdc")) remotes::install_github("brunobrr/bdc")Creating folders to save the results
Read the merged database created in the step Standardization and integration of different datasets of the BDC workflow. It is also possible to read any datasets containing the required fields to run the workflow.
Standardization of character encoding
VALIDATION. This test flags records missing species names
VALIDATION. This test flags records missing partial or complete information on geographic coordinates.
VALIDATION. This test flags records with out-of-range coordinates, that is latitude > 90 or -90; longitude >180 or -180.
VALIDATION. This test flags records from doubtful source. For example, records from drawings, photographs, or multimedia objects, fossil records, among others.
ENRICHMENT. Deriving country names for records missing country names.
ENRICHMENT. Country names are standardized against a list of country names in several languages retrieved from Wikipedia.
check_pf <- bdc_country_standardized(
data = check_pf,
country = "country"
)
#> Loading auxiliary data: country names from wikipedia
#> Loading auxiliary data: world map and country iso
#> Standardizing country names
#> country found: Argentina
#> country found: Belize
#> country found: Bolivia
#> country found: Brazil
#> country found: Colombia
#> country found: Ecuador
#> country found: France
#> country found: French Guiana
#> country found: Guyana
#> country found: Honduras
#> country found: Japan
#> country found: Mexico
#> country found: Nicaragua
#> country found: Paraguay
#> country found: Suriname
#> country found: Uruguay
#> country found: Venezuela
#>
#> bdc_country_standardized:
#> The country names of 8540 records were standardized.
#> Two columns were added to the database.AMENDMENT. The mismatch between informed country and coordinates can be the result of negative or transposed coordinates. Once detected a mismatch, different coordinate transformations are made to correct the country and coordinates mismatch. Verbatim coordinates are then replaced by the rectified ones in the returned database (a database containing verbatim and corrected coordinates is also created in the “Output” folder).
check_pf <-
bdc_coordinates_transposed(
data = check_pf,
id = "database_id",
sci_names = "scientificName",
lat = "decimalLatitude",
lon = "decimalLongitude",
country = "country",
countryCode = "countryCode",
border_buffer = 0.2 # in decimal degrees (~22 km at the equator)
)
#> Correcting latitude and longitude transposed
#> Testing coordinate validity
#> Removed 1522 records.
#> Testing coordinate validity
#> Flagged 0 records.
#> Testing sea coordinates
#> Flagged 704 records.
#> Testing country identity
#> Flagged 716 records.
#> Flagged 716 of 7018 records, EQ = 0.1.
#> 716 ocurrences will be tested
#> Processing occurrences from: BR (713)
#> Processing occurrences from: CO (1)
#> Processing occurrences from: MX (1)
#> Processing occurrences from: VE (1)
#>
#> bdc_coordinates_transposed:
#> Corrected 19 records.
#> One columns were added to the database.
#> Check database containing coordinates corrected in:
#> Output/Check/01_coordinates_transposed.csvVALIDATION. Records outside one or multiple reference countries; i.e., records in other countries or at an informed distance from the coast (e.g., in the ocean). This last step avoids flagging as invalid records close to country limits (e.g., records of coast or marshland species).
check_pf <-
bdc_coordinates_country_inconsistent(
data = check_pf,
country_name = "Brazil",
lon = "decimalLongitude",
lat = "decimalLatitude",
dist = 0.1 # in decimal degrees (~11 km at the equator)
)
#> dist is assumed to be in decimal degrees (arc_degrees).
#> although coordinates are longitude/latitude, st_intersection assumes that they are planar
#>
#> bdc_coordinates_country_inconsistent:
#> Flagged 658 records.
#> One column was added to the database.ENRICHMENT. Coordinates can be derived from a detailed description of the locality associated with records in a process called retrospective geo-referencing.
xyFromLocality <- bdc_coordinates_from_locality(
data = check_pf,
locality = "locality",
lon = "decimalLongitude",
lat = "decimalLatitude"
)
#>
#> bdc_coordinates_from_locality
#> Found 1944 records missing or with invalid coordinates but with potentially useful information on locality.
#>
#> Check database in: C:/Users/Bruno Ribeiro/Documents/bdc/vignettes/Output/Check/01_coordinates_from_locality.csvCreating a column named “.summary” summarizing the results of all VALIDATION tests. This column is “FALSE” if any test was flagged as “FALSE” (i.e. potentially invalid or suspect record).
check_pf <- bdc_summary_col(data = check_pf)
#>
#> bdc_summary_col:
#> Flagged 2888 records.
#> One column was added to the database.Creating a report summarizing the results of all tests.
report <-
bdc_create_report(data = check_pf,
database_id = "database_id",
workflow_step = "prefilter")
#>
#> bdc_create_report:
#> Check the report summarizing the results of the prefilter in:
#> Output/Report
report| Description | Test_name | Records_flagged | perc_number_records(*) |
|---|---|---|---|
| Records with empty scientific name | .scientificName_empty | 324 | 3.6 |
| Records with empty coordinates | .coordinates_empty | 1921 | 21.34 |
| Records coordiantes out-of-range | .coordinates_outOfRange | 23 | 0.26 |
| Records from doubtful source | .basisOfRrecords_notStandard | 5 | 0.06 |
| Records outside one or multiple reference countries | .coordinates_country_inconsistent | 658 | 7.31 |
| Summary of all tests | .summary | 2888 | 32.09 |
| (*) calculated in relation to total number of records, i.e. 9000 records |
Creating figures (bar plots and maps) to facilitate the interpretation of the results of data quality tests.
bdc_create_figures(data = check_pf,
database_id = "database_id",
workflow_step = "prefilter")
#> Check figures in C:/Users/Bruno Ribeiro/Documents/bdc/vignettes/Output/FiguresTransposed coordinates
Coordinates and contry inconsistent
Summary of all tests
It is possible to removed flagged records (potentially problematic ones) to get a ‘clean’ database (i.e., without test columns starting with “.”). However, to ensure that all records be evaluated in all the data quality tests (i.e., tests of the taxonomic, spatial, and temporal steps of the workflow), potentially erroneous or suspect records will be removed in the final step of the workflow.